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(57) Abstract: The invention relates to indexing of digitised entities (E) in a large and comparatively unstructure data collection 
(5 10), for insstance the Internet, such that text-based searches (S; S*) wht respect to the data collection (5 1 0) can be ordered (Q) via a 
user client terminal (560). Index information (I E ) is generated (522) for each digitised entity (E), which contains distinctive features 
( { K }) being ranked according to a rank parameter. The rank parameter indicates a degree of relevanceof particular distinctive freature 
(K) wiht respect to a give digitised entity (E) and is derived from fields or tags associated wiht one or more copies of the digitised 
entity (E) in the data collection (510). The index information (I E ) is stored in a searchable database (530), which is accessible via 
a user client interface (550) and a serch engine (540). The derived distinctive features (K) and the rank parameter thus provides a 
possibility to carry out text-based searches (Q) in respect of non-text digitised entities (E), such as images, audio files and video 
sequences and obtain a highly relevant search result ({H}). 
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Indexing of Digitised Entities 

THE BACKGROUND OF THE INVENTION AND PRIOR ART 

5 The present Invention relates generally to indexing of digitised 
entities in a large and comparatively unstructured data collection 
such that a relevant search result can be obtained. More 
particularly the invention relates to a method of indexing 
digitised entities, such as images, video or audio files, according 
10 to the preamble of claim 1. The invention also relates to a 
computer program according to claim 13, a computer readable 
medium according to claim 14, a database according to claim 15 
and a server/client system according to the preamble of claim 
16. 

15 Search engines and index databases for automatically finding 
information in digitised text banks have been known for 
decades. In recent years the rapid growth of the Internet has 
intensified the development in this area. Consequently, there 
are today many examples of very competent tools for finding 

20 text information in large and comparatively unstructured data 
collections or networks, such as the Internet. 

As the use of the Internet has spread to a widened group of 
users, the content of web pages and other resources has 
diversified to include not only text, but also other types of 
25 digitised entities, like graphs, images, video sequences, audio 
sequences and various other types of graphical or acoustic files. 
An exceptionally wide range of data formats may represent 
these files. However, they all have one feature in common, 
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namely that they per se lack text information. Naturally, this fact 
renders a text search for the information difficult. Various 
attempts to solve this problem have nevertheless already been 
made. 

5 For instance, the US patent 6.084,595 describes an indexing 
method for generating a searchable database from images, such 
that an image search engine can find content based information 
in images, which match a user's search query. Feature vectors 
are extracted from visual data in the images. Primitives, such as 

10 colour, texture and shape constitute parameters that can be 
distilled from the images. A feature vector is based on at least 
one such primitive. The feature vectors associated with the 
images are then stored in a feature database. When a query is 
submitted to the search engine, a query feature vector will be 

15 specified, as well as a distance threshold indicating the 
maximum distance that is of interest for the query. All images 
having feature vectors within that distance will be identified by 
the query. Additional information is computed from the feature 
vector being associated with each image, which can be used as 

20 a search index. 

An alternative image and search retrieval system is disclosed in 
the international patent application W099/22318. The system 
includes a search engine, which is coupled to an image analyser 
that in turn has access to a storage device. Feature modules 

25 define particular regions of an image and measurements to 
make on pixels within the defined region as well as any 
neighbouring regions. The feature modules thus specify 
parameters and characteristics which are important in a 
particular image match / search routine. As a result, a relatively 

30 rapid comparison of images is made possible. 

The international patent application WO00/33575 describes a 
search engine for video and graphics. The document proposes 
the creation and storage of identifiers by searching an area 
within a web page near a graphic file or a video file for 
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searchable identification terms. Areas on web pages near links 
to graphic or video files are also searched for such identification 
terms. The identification terms found are then stored in a 
database with references to the corresponding graphic and 
5 video files, A user can find graphic or video files by performing a 
search in the database. 

However, the search result will, in general, still not be of 
sufficiently high quality, because the identification terms are not 
accurate enough. Hence, relevant files may either end up 
10 comparatively far down in the hit list or be missed completely in 
the search. 



SUMMARY OF THE INVENTION 

It is therefore an object of the present invention to alleviate the 
problem above and thus provide an improved solution for finding 
15 relevant digitised entities, such as images, video files or audio 
files, by means of an automatic search being performed with 
respect to a large and relatively unstructured data collection, 
such as the Internet. 

According to one aspect of the invention the object is achieved 
20 by a method of indexing digitised entities as initially described, 
which is characterised by generating index information for a 
particular digitised entity on basis of at least one rank 
parameter. The rank parameter is derived from basic infor- 
mation, which in turn pertains to at least one distinctive feature 
25 and at least one locator for each of the digitised entities. The 
rank parameter indicates a degree of relevance for at least one 
distinctive feature with respect to each digitised entity. 

According to another aspect of the invention these objects are 
achieved by a computer program directly loadable into the 
30 internal memory of a digital computer, comprising software for 
controlling the method described in the above paragraph when 
said program is run on a computer. 
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According to yet another aspect of the invention these objects 
are achieved by a computer readable medium, having a program 
recorded thereon, where the program is to make a computer 
perform the method described in the penultimate paragraph 
5 above. 

According to an additional aspect of the invention the object is 
achieved by a database for storing index information relating to 
digitised entities, which have been generated according to the 
proposed method. 

10 According to yet an additional aspect of the invention the object 
is achieved by a server/client system for searching for digitised 
entities in a data collection as initially described, which is 
characterised in that an index database in the server/client 
system is organised, such that index information contained 

15 therein, for a particular digitised entity comprises at least one 
rank parameter. The rank parameter is indicative of a degree of 
relevance for at least one distinctive feature with respect to the 
digitised entity. 

The invention provides an efficient tool for finding highly 
20 relevant non-text material on the Internet by means of a search 
query formulated in textual terms. An advantage offered by the 
invention is that the web pages, or corresponding resources, 
where the material is located need not contain any text 
information to generate a hit. 

25 This is an especially desired feature, in comparison to the 
known solutions, since in many cases the non-text material may 
be accompanied by rather laconic, but counter intuitive text 
portions. 

A particular signature for each unique digitised entity utilised in 
30 the solution according to the invention makes it possible 
eliminate any duplicate copies of digitised entities in a hit list 
obtained by a search. Naturally, this further enhances the 
search quality. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention is now to be explained more closely by 
means of preferred embodiments, which are disclosed as 
examples, and with reference to the attached drawings. 

5 Figure 1 illustrates the generation of a first rank component in 
a proposed rank parameter according to an embodi- 
ment of the invention, 

Figure 2 illustrates an enhancement of the first rank 
component according to an embodiment of the 
10 invention, 

Figure 3 illustrates the generation of a second rank com- 
ponent in the proposed rank parameter according to 
an embodiment of the. invention, 

Figure 4 demonstrates an exemplary structure of a search 
15 result according to an embodiment of the invention, 

Figure 5 shows a block diagram over a server/client system 
according to an embodiment of the invention, and 

Figure 6 illustrates, by means of a flow diagram, an 
embodiment of the method according to the 
20 invention. 



DESCRIPTION OF PREFERRED EMBODIMENTS OF THE 
INVENTION 

The invention aims at enhancing the relevancy of any distinctive 
features, for instance keywords, being related to digitised 
25 entities and thereby improving the chances of finding relevant 
entities in future searches. In order to achieve this objective, at 
least one rank parameter is allocated to each distinctive feature 
that is related to a digitised entity. The embodiment of the 
invention described below refers to digitised entities in the form 
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of images. However, the digitised entities may equally well 
include other types of entities that are possible to identify 
uniquely, such as audio files or video sequences. Moreover, the 
digitised entities may either constitute sampled representations 
5 of analogue signals or be purely computer-generated entities. 

Figure 1 shows four copies c a - c d of one and the same image n 
that are stored at different locations in a data collection, for 
instance in an internetwork, like the Internet. The identity of the 
image n can be assessed by means of a so-called image 
10 signature, which may be determined from a total sum of ail pixel 
values contained in the image. A corresponding identity may, of 
course, be assessed also for an audio file or a video file. 

The copies c a - c d of the image n are logically grouped together 
in a cluster C n . Each copy c a - c d is presumed to be associated 

15 with at least one distinctive feature in the form of a keyword. 
Typically, the keywords are data that are not necessarily being 
shown jointly with the image. On the contrary, the keywords may 
be collected from data fields which are normally hidden to a 
visitor of a certain web page. Thus, the keywords may be taken 

20 from HTML-tags such as Meta, Img or Title (HTML = HyperText 
Mark-up Language). 

In this example a first copy c a of the image n is associated with 
the keywords k 1t k 2l k 3 , k 4 up to k ja , a second copy c b is 
associated with the keywords k 3 , k 4l k 7) k 19 up to k jb , a third copy 
25 c c is associated with the keywords k 1f k 3 , k 4 , k 5 up to k jc , and a 
fourth copy c d is associated with the keywords k 2 , k 4 , k 9 , k 12 up 
to k jd . In order to determine the relevance of a particular 
keyword, say k 3 , with respect to the image n a first rank 
component r n (k 3 ) is calculated according to the expression: 
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where £k i3 represents a sum of ail occurrences of the keyword 

k 3 in the cluster C n and | C„ I denotes a total number of 
keywords in the cluster C n , i.e. the sum of unique keywords plus 
any copies of the same. 

5 However, it is also quite common that a particular keyword, for 
instance k 3 , is associated with many different images. This is 
illustrated in figure 2. Here, a first cluster d contains nine 
copies cn - c 19 of a first image n 1( a second cluster C 2 contains 
four copies c 2 i - c 24 of a second image n 2 and a third cluster C 3 

10 contains one copy c 31 of a third image n 3 . The keyword k 3 occurs 
twice (affiliated with cn and c 12 ) in the first cluster C 1f three 
times (affiliated with c 21 , c 22 and c 24 ) in the second cluster C 2 
and once (affiliated with c 31 ) in the third cluster C 3 . The copy c 12 
occurs twice in the first cluster d, on one hand associated with 

15 the keyword k 3 and on the other hand associated with a different 
keyword. In both cases, however, it is the same image. 

The first rank component r for the keyword k 3 may now be 
improved by means of a figure reflecting the strength in linkage 
between the keyword k 3 and the images - n 3 (or clusters d - 

20 C 3 ) to which it has been associated. The keyword k 3 appears to 
have its strongest link to the second image n 2 , since it is 
associated with the largest number of copies of this image, 
namely c 21 , c 22 and c 24 . Correspondingly, the keyword k 3 has a 
second strongest link to the first image (where it occurs in 

25 two out of nine copies), and a third strongest link to the third 
image n 3 . A normalisation with respect to the largest cluster (i.e. 
the cluster which includes the most copies) may be used to 
model this aspect. In this example, the largest cluster d 
includes nine copies Cn - c 19 . Therefore, a normalisation of the 

30 keyword k 3 with respect to the images - n 3 is obtained by 
multiplying the first rank component r n (k 3 ) with the respective 
number of occurrences in each cluster d - C 3 divided by nine. 
Of course, the general expression becomes: 
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IX- ici 2X 

r n(kj)= n ' - ' 



where I C max I is the largest number of keywords in a cluster for 
any image that includes the relevant keyword kj, for instance k 3 . 

The first rank component r is made more usable for automated 
processing if it is also normalised, such that the highest first 
rank component r for a particular keyword is equal to 1. This is 
accomplished by dividing the expression above with the 
following denominator: 

(2Xj) max - k j 



10 where (^k.^max.kj denotes the. number of occurrences of the 

i 

keyword kj in the cluster, which includes most occurrences of 
this keyword kj. For instance, (2> i3 )max,k 3 is equal to 3 in the 

i 

present example, since the keyword k 3 occurs most times in the 
second cluster C 2 , namely three times. 

15 Hence, the first rank component r n (kj) for an image n with 
respect to keyword kj is preferably modelled by the simplified 
expression: 

2X 

r n (kj) = 



(Z k u) max . k j 



where £k M represents the sum of all occurrences of the 

i 

20 keyword kj in the cluster C n and (I]k ij )max,k j is the number of 

occurrences of the keyword kj in the cluster, which includes 
most occurrences of this keyword kj. 

However, in order to the improve the search performance in a 
database containing indexed elements referring to the digitised 
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entities, it is preferable to build an inverted index on keywords, 
such that a set of first rank components r is instead expressed 
for each keyword kj. Thus, according to a preferred embodiment 
of the invention, the format of the first rank component is kji{r n }. 
5 Consequently, the keyword k 3 in the example above obtains the 
following set of first rank components: 

k 3 : {r 2 =1; r 1 =2/3; T 3 =1/3} 

The first rank component r n (kj) itself constitutes a fair reflection 
of the relevancy of a keyword kj with respect to the image n. 
10 However, a more accurate figure can be obtained by combining 
the first rank component r n (kj) with a proposed second rank 
component n n (kj), which will be described below. 

Figure 3 illustrates how the second rank component n n (kj) may 
be generated according to an embodiment of the invention. 

15 A digitised entity, e.g. an image 301, is presumed to be 
associated with distinctive features k,, k 2 and k 3 , for instance in 
the form of keywords, which are found at various positions P in 
a descriptive field F. Each distinctive feature k-, - k 3 is estimated 
to have a relevance with respect to the digitised entity 301 that 

20 depends on the position P in the descriptive field F in which it is 
found. A weight factor w-i - w p for each position 1 - p in the 
descriptive field F reflects this. In the illustrated example, a first 
distinctive feature k, and a second distinctive feature k 2 are 
located in a position 1 in the descriptive field F. Both the 

25 distinctive feature k, and the distinctive feature k 2 occur a 
number t] 1 times in this position. There are no distinctive 
features in a second position 2. However, various distinctive 
features may be located in following positions 3 to p-2 (not 
shown). The field F contains ti 2 copies of the first distinctive 

30 feature ki in a position p-1 and copies of the second 
distinctive feature k 2 respective tj 3 copies of a third distinctive 
feature k 3 in a position p. 
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Hence, depending on the position 1 - p in which a certain 
distinctive feature k-i - k 3 is found, the distinctive feature k, - k 3 
is allocated a particular weight factor - w p . Furthermore, a 
relevancy parameter s-, - s 4 is determined for every distinctive 
5 feature ki - k 3l which depends on how many times - r\ 3 the 
distinctive feature k^ - k 3 occurs in a position 1 - p relative a 
total number of distinctive features in this position 1 - p. 

Thus, both the first distinctive feature k-i and the second 
distinctive feature k^ obtain the same relevancy parameter s 1f 
10 which can be calculated as s<\ = r| 1 /(2r| 1 ) = 1/2 in the position 1. 
This parameter Si is further weighted with a weight factor w 1 in 
respect of the digitised entity 301. The same calculations are 
performed for all the positions 2 - p to obtain corresponding 
relevancy parameters s<i - s 4 for these positions. 

15 Alternatively, the relevancy parameter s P can be determined as 
Sp(kj^ i )=1-Y^k j , where y^k; represents a "penalty" that 



decreases the relevancy for a distinctive feature k, in a position 
P, for each distinctive feature in this position, which is different 
from the distinctive feature kj. Naturally, other alternative 
20 formulas for determining the relevancy parameter s P are also 
conceivable. 

Nevertheless, a combined measure is determined, which fully 
captures the relationship between distinctive features kj and 
digitised entities n. The expression: 



constitutes a reflection of the relevance of a distinctive feature kj 
with respect to a particular digitised entity n. The variable Wj 
denotes the weight factor for a position i and the variable s ifj 
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denotes the relevancy parameter for a distinctive feature kj in 
the position i. 

In analogy with the first rank component r, its is preferable also 
to normalise and build an inverted index on keywords. The 
5 second rank component n is thus given a format kj:{n n }, where 
the first component n n for a particular distinctive feature kj is 
always equal to 1 . 

Table 1 below shows an explicit example over weight factors Wj 
for a certain positions P in a descriptive field F related to an 
10 image. 





Field (F) 


Weiaht factor (w P ) 


1 


pageSite 


50 


2 


pageDir 


40 


3 


pageName 


50 


4 


pageTitle 


80 


5 


pageDescription 


90 


6 


pageKeywords 


90 


7 


pageText 


20 


8 


imageSite 


50 


9 


imageDir 


60 


10 


imageName 


100 


11 


imageAlt 


100 


12 


imageAnchor 


80 


13 


imageCenterCaption 


90 


14 


imageCellCaption 


90 


15 


imageParagraphCaption 


90 



Table 1 



According to an embodiment of the invention, the second rank 
component n n (kj) is used as an alternative to the first rank 
component r n (kj). The second rank component n n (kj) is namely 
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also a per se good descriptor of the relevancy of a keyword kj 
with respect to the image n. 

In a preferred embodiment of the invention, however, the first 
rank component r and the second rank component n are 
5 merged into a combined rank parameter A according to the 
expression: 

/ (ar) 2 +(pn) 2 

V a 2 +p 2 

where a is a first merge factor and p is a second merge factor. 
For instance, 0<a<1 and 0<p<1. However, any other range of 
10 the merge factors a; p are likewise conceivable. 

Finally and in similarity with the first and second rank 
components r and n respectively, it is preferable to normalise 
and build an inverted index on keywords, such that it obtains a 
format kj:{A n }, where the first component A n for a particular 
15 distinctive feature kj is always equal to 1. 

When all, or at least a sufficiently large portion, of the digitised 
entities in the data collection have been related to at least one 
distinctive feature and a corresponding rank component/ 
parameter (r, n or A), an index database is created, which also 
at least includes a field for identifying the respective digitised 
entity and a field containing one or more locators that indicate 
where the digitised entity can be retrieved. Moreover, it is 
preferable if the index database contains an intuitive 
representation of the digitised entity. If the digitised entity is an 
image, a thumbnail picture constitutes a suitable representation. 
If, however, the digitised entity is an audio file or multimedia 
file, other representations might prove more useful, for instance 
in the form of logotypes or similar symbols. 

Figure 4 demonstrates an exemplary structure of a search result 
30 according to an embodiment of the invention. The search result 
is listed in a table 400, where a first column E contains the 



20 



25 
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identity ID^ - lD m of the entities that matched the search criteria 
sufficiently well. A second column K contains an inventory of 
ranked distinctive features A(k-i) - A(k 2 3) for each digitised entity. 
A third column R includes a characterising representation (or an 
5 illustrative element) r 1 - r m of the entity and a fourth column L 
contains at least one locator h - l m to a corresponding "full 
version" of the entity. In case the data collection is an 
internetwork the locator ^ - l m is typically a URL (Universal 
Resource Locator). However, any other type of address is 
10 equally well conceivable. 

Naturally, the search result structure may also include arbitrary 
additional fields. A reduced set of fields may then be presented 
to a user. It could, for instance, be sufficient to display only the 
representation r 1 - r m and / or a limited number of the distinctive 
15 features, with or / without their respective ranking. 

Figure 5 shows a block diagram over a server/client system 
according to an embodiment of the invention, through which 
data may be both indexed, searched and retrieved. Digitised 
entities are stored in large and rather unstructured a data 

20 collection 510, for instance the Internet. An indexing input 
device 520 gathers information ID n , {K}; L from the data 
collection 510 with respect to digitised entities contained 
therein. The information ID n , {K}; L includes at least an identity 
field ID n that uniquely defines the digitised entity E, a set of 

25 distinctive features {K} and a locator L. Additional data, such as 
file size and file type may also be gathered by the indexing input 
device 520. It is irrelevant exactly how the information ID n , {K}; 
L is entered into the indexing input device 520. However, 
according to a preferred embodiment of the invention, an 

30 automatic data collector 521, for instance in the form of an web 
crawler, in the indexing input device 520 regularly accumulates 
C the information ID n , {K}; L as soon as possible after addition 
of new items or after updating of already stored items. An index 
generator 522 in the indexing input device 520 creates index 

35 information l E on basis of the information ID n , {K}; L according to 
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the methods disclosed above. An index database 530 stores the 
index information l E in a searchable format, which is at least 
adapted to the operation of a search engine 540. 

One or more user client terminals 560 are offered a search 
5 interface towards the index information l E in the index database 
530 via a user client interface 550. A user may thus enter a 
query phrase Q, for instance, orally via a voice recognition 
interface or by typing, via a user client terminal 560. Preferably, 
however not necessarily, the user client interface 550 re- 
10 formulates the query Q into a search directive, e.g. in the form 
of a search string S, which is adopted to the working principle of 
the search engine 540. The search engine 540 receives the 
search directive S and performs a corresponding search S' in 
the index database 530. 

15 Any records in the database 530 that match the search 
directives S sufficiently well are sorted out and returned as a hit 
list {H} of digitised entities E to the user client interface 550. If 
necessary, the user client interface 550 re-formats the hit list 
{H} into a search result having a structure H(R, L), which' is 

20 better suited for human perception and / or adapted to the user 
client terminal 560. The hit list {H} preferably has the general 
structure shown in figure 4. However, the search result H(R, L) 
presented via the user client terminal 560 may have any other 
structure that is found appropriate for the specific application. If 

25 the query phrase Q comprises more than one search term (or 
distinctive feature), the search result H(R. L) has proven to 
demonstrate a desirable format when each search term in the hit 
list {H} is normalised before presentation to the user, such that a 
first combined rank parameter A n for each search term is equal 

30 to 1. For instance, a hit list {H} resulting from a search query Q 
= "ferarri 550" is normalised such that the first combined rank 
parameter A n = 1 both with respect to "ferarri" and with respect 
to "550". Any additional combined rank parameters A m for the 
respective search terms may, of course, have arbitrary lower 

35 value depending on the result of the search. 



WO 02/073463 



15 



PCT/SE02/00462 



The signature associated with each unique digitised entity 
makes it possible eliminate any duplicate copies of digitised 
entities in the search result H(R, L). Such elimination produces 
a search result H(R, L) of very high quality and relevance. 

5 A minimum requirement is that the data sent to the user client 
terminal 560 includes a characteristic representation R of the 
digitised entities in the hit list {H} and corresponding locators L, 
e.g. URL, for indicating at least one storage location in the data 
collection 510. The latter gives the user at least a theoretical 
10 possibility to retrieve full versions of the digitised entities. In 
practice, however, the retrieval may be restricted in various 
ways, for instance by means of copyright protection and 
therefore require the purchase of the relevant rights. 

The units 510 - 560 may either be physically separated from 
15 each other or be co-located in arbitrary combination. 

In order to sum up, a method of generating a searchable index 
for digitised entities according to an embodiment of the 
invention will now be described with reference to a flow diagram 
in the figure 6. 

20 A first step 601 involves input of basic information that contains 
one or more distinctive features being related to digitised 
entities in a data collection. A following step 602 creates rank 
parameters for each of the digitised entities on basis of the input 
information. Then, a step 603 generates a searchable index for 

25 the rank parameters and finally, the searchable index is stored 
in a searchable database in a step 604. 

All of the process steps, as well as any sub-sequence of steps, 
described with reference to the figure 6 above may be controlled 
by means of a computer program being directly loadable into the 
30 internal memory of a computer, which includes appropriate 
software for controlling the necessary steps when the program is 
run on a computer. The computer program can likewise be 
recorded onto arbitrary kind of computer readable medium. 
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The term "comprises/comprising" when used in this specification 
is taken to specify the presence of stated features, integers, 
steps or components. However, the term does not preclude the 
presence or addition of one or more additional features, 
5 integers, steps or components or groups thereof. 

The invention is not restricted to the described embodiments in 
the figures, but may be varied freely within the scope of the 
claims. 
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Claims 

1. A method of indexing digitised entities (E) in a data 
collection (510) comprising: 

inputting basic information (ID n , {K}, L) pertaining to at 
5 least one distinctive feature ({K}) and at least one locator (L) for 
each digitised entity (E) in a set of entities from the data 
collection (510), 

generating searchable index information (l E ) related to the 
digitised entities (E) in the set on basis of the basic information 
10 (ID n , {K}, L), and 

storing the index information (l E ) in an index database 
(530), characterised by 

generating the index information (l E ) for a particular digitised 
entity (E: ID n ) on basis of at least one rank parameter (A(k 3 ), 
15 A(k 5 ); A(k 19 )) derived from the basic information (ID n , {K}, L),.the 
at least one rank parameter (A(k 3 ), A(k 5 ); A(k 19 )) being indicative 
of a degree of relevance for at least one distinctive feature (k 3 , 
ks; k 19 ) with respect to the digitised entity (E: ID n ). 

2. A method according to claim 1, characterised by the at 
20 least one rank parameter (A(k 3 ), A(k 5 ); A(k 19 )) being based on a 

first rank component (r) that is generated by a first algorithm, 
which involves ranking individual distinctive features (ki - k jd ) 
related to the digitised entity (E: n) on basis of a relative 
occurrence of the individual distinctive features (ki - k jd ) with 
25 respect to one or more copies (c a - c d ) of the digitised entity 
(E: n) in the data collection (510). 

3. A method according to claim 1, characterised by the first 
algorithm involving the following steps, with respect to a 
particular distinctive feature (k 3 ), for the digitised entity (E: n): 

30 grouping at least one copy (c a -c d , c n1 -c 19 , c 2 i-c 2 4; c 3 i) of 

at least the digitised entity (E: n; n,, n 2 , n 3 ) in a cluster (C n ), 
each cluster (C n , C 1t C 2 . C 3 ) containing one or more copies of 
the same digitised entity (E: n; n,, n 2 , n 3 ), 
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counting a total number of occurrences of the particular 
distinctive feature (k 3 ) in each cluster (C n , C^, C 2 , C 3 ), and 

calculating a ratio between the total number of occur- 
rences of the particular distinctive feature (k 3 ) in the cluster (C n ) 
5 for the digitised entity (E: n) and the total number of occur- 
rences of the particular distinctive feature (k 3 ) in a cluster (C 2 ) 
which includes a largest number of the particular distinctive 
feature (k 3 ). 



4. A method according to any one of the claims 1-3, 
10 characterised by the at least one rank parameter (A(k 3 ), A(k 5 ); 

A(k 19 )) being based on a second rank component (n) that is 
generated by a second algorithm, which involves ranking at 
least one individual distinctive feature (k 1t k 2 , k 3 ) related to the 
digitised entity (301) on basis of a position (P) of the least one 
15 individual distinctive feature (k 1f k 2 , k 3 ) in a descriptive field (F) 
associated with the digitised entity (E). 

5. A method according to claim 4, characterised by 
generating the second rank component (n) on basis of a 
particular weight factor (w, - w p ) being linked to each position 

20 (1 - p) in the descriptive field (F), the weight factors (w, - w p ) 
reflecting a distinctive feature's (k 1f k 2 , k 3 ) significance with 
respect to its position (P) in the descriptive field (F). 

6. A method according to claim 5, characterised by 
generating the second rank component (n) on basis of a 

25 relevancy parameter ( Sl - s 4 ) reflecting a distinctive feature's 
(k,) significance in relation to other distinctive features (k 2 ) in a 
particular position (p) in the descriptive field (F). 

7. A method according to any one of the claims 4 - 6, 
characterised by the generating of the rank parameter (A(k 3 )] 
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A(k 5 ); A(k 19 )) involving a combination of the first rank component 
(T) with the second rank component (n). 

8. A method according to claim 7, characterised by 
combining the first rank component (r) with the second rank 
5 component (EI) according to the expression: 

where r represents the first rank component, n represents the 
second rank component, a represents a first merge factor and (3 
represents a second merge factor. 

10 9. A method according to any one of the preceding claims, 
characterised by the digitised entities (E) including at least one 
of the file types: a text document, an image, a video sequence 
and an audio sequence. 

10. A method according to claim 9, characterised by at least 
15 one of the digitised entities (E) constituting a sampled 

representation of an analogue signal. 

11. A method according to claim 9, characterised by at least 
one of the digitised entities (E) constituting a computer 
generated entity. 

20 12. A method according to any one of the preceding claims, 
characterised by the distinctive feature (k A - k jd ) being a 
keyword. 

13. A computer program directly loadable into the internal 
memory of a digital computer, comprising software for 
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performing the steps of any of the claims 1-12 when said 
program is run on a computer. 

14. A computer readable medium, having a program recorded 
thereon, where the program is to make a computer perform the 

5 steps of any of the claims 1-12. 

15. A database for storing index information (l E ) relating to 
digitised entities (E), which have been generated according to 
any one of the claims 1-12. 

16. A server/client system for searching for digitised entities 
10 (E) in a data collection (510) comprising 

an indexing input device (520) for collecting basic 
information (ID n , {K}, L) pertaining to at least one distinctive 
feature ({K}) and at least one locator (L) for each digitised entity 
(E) in a set of entities from the data collection (510), 

15 an index database (530) for storing index information (l E ) 

relating to the digitised entities (E) in the set, 

a search engine (540) for receiving search directives (S) 
and in response thereto performing searches (S') in the index 
database (530), and 

20 a user client interface (550) for receiving a search request 

(Q) from at least one user client terminal (560), forwarding the 
search request (Q) as a search directive (S) to the search 
engine (540), receiving a hit list ({H}) of digitised entities (E) 
and returning a result ((H(R, L)) of a corresponding search (S') 

25 in the index database (530) to the at least one user client 
terminal (560), characterised in that 

the index database (530) is organised such that the index 
information (l E ) for a particular digitised entity (E: ID n ) comprises 
at least one rank parameter (A(k 3 ), A(k 5 ); A(k 19 )), which is 
30 indicative of a degree of relevance for at least one distinctive 
feature (k 3) k 5 ; k 19 ) with respect to that digitised entity (E: ID n ). 
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17. A server/client system according to claim 16, 
characterised in that the indexing input device (520) includes 
an index generator (522) for receiving the basic information (ID n , 
{K}, L) and producing in response thereto the at least one rank 
5 parameter (A(k 3 ), A(k 5 ); A(k 19 )). 



18. A server/client system according to any one of the claims 
16 or 17, characterised in that the rank parameter (A(k 3 ), A(k 5 ); 
A(k 19 )) includes a first rank component (r) indicating a ranking 
of at least one individual distinctive feature (k, - k jd ) related to 
the digitised entity (E: n) on basis of a relative occurrence of the 
at least one individual distinctive feature (k-, - k jd ) with respect 
to one or more copies (c a - c d ) of the digitised entity (E: n) in 
the data collection (510). 



19. A server/client system according to any one of the claims 
15 16-18, characterised in that the rank parameter (A(k 3 ), A(k 5 ); 

A(k 19 )) includes a second rank component (n) indicating a 

ranking of at least one individual distinctive feature (k^ k 2 ) 

related to the digitised entity (301) on basis of 

a position (P) of the least one individual distinctive feature 
20 (kL k 2 ) in a descriptive field (F) associated with the digitised 

entity (E), and 

a relevancy parameter (s, - s 4 ) reflecting a distinctive 
feature's (k,) significance in relation to other distinctive features 
(k 2 ) in a particular position (p) in the descriptive field (F). 

25 20. A server/client system according to any one of the claims 
16-19, characterised in that the indexing input device (520) 
includes an automatic data collector (521) for finding relevant 
digitised entities (E) in the data collection (510) and creating 
there from the set of entities. 



WO 02/073463 



22 



PCT/SE02/00462 



21. A server/client system according to any one of the claims 
16-20, characterised in that each of the digitised entities (E) in 
the hit list ({H}) is associated with an identifier (ID, - ID m ), at 
least one rank parameter (A(k 2 ), A(k 5 ); A(k 6 ) - A(k 5 ); A(k 12 )) and 

5 at least one locator (I, - l m ) for indicating a storage location in 
the data collection (510). 

22. A server/client system according to claim 21, 
characterised in that each of the digitised entities (E) in the hit 
list ({H}) is also associated with an illustrative element (n - r m ) 

10 to be displayed on the user client terminal (560) together with 
the respective digitised entity (E). 

23. A server/client system according to claim 22, 
characterised in that the illustrative element (n - r m ) is a 
thumbnail picture. 

15 24. A server/client system according to any one of the claims 
20-23, characterised in that the data collection (510) is an 
internetwork and the indexing input device (520) includes a web 
crawler (522). 

25. A server/client system according to any one of the claims 
20 16-24, characterised in that the digitised entities (E) include at 

least one of the file types: a text document, an image, a video 
sequence and an audio sequence. 

26. A server/client system according to claim 25, 
characterised in that the digitised entities (E) are stored in at 

25 least one of the formats: AIF, AIFC, AIFF, AU, AVI, BMP, DIVX, 
DOC, EPS. GIF, ICO, JPEG, JPG, MOV, MP3, MP4, MPEG, 
MPEG4, MPG, PDF, PNG, PPT, PS, QT, RA, RAM, RAS, SND* 
TIF, TIFF, VCD, WAV, XLS and XMP. 
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